Regulatory variants associated with COPD
Chronic Obstructive Pulmonary Disease (COPD) is a lung condition characterized by breathing difficulties. Previously it was studied the gene expression profiles of this disease and published in PulmonDB database. But it was unknown which variants cause the disease, which is the main objective, where I found the variants that change the expression of the previously analyzed genes.
To do that, I focused on retrieving the variants previously annotated for COPD genes and the disease, then analyzing them with Regulatory Sequence Analysis Tools (RSAT) to retrieve transcriptional factor motifs where the variants are affecting. Then I searched for genes close to the variants and filtered for those in a meta-analysis (realized by Ana Beatriz Villaseñor), which complemented the data from PulmonDB.
Figure 1:Flowchart of the tools used.
Retrieve variants
For the variant search, I used a database created by Lucia Ramirez, which included data from the Genome-Wide Association Study (GWAS), ClinVar, Genotype-Tissue Expression (GTEx) project, and the Single Nucleotide Polymorphism Database (dbSNP). I searched for variants that were associated with the disease and the genes from the meta-analysis from Ana B.
Disease associated variants
I started filtering the variants related to COPD, FEV1, FVC and Emphysema. Then, I created a file where I kept the variant ID, chromosome name, and the position. I retrieved 4,370 variants from here.
# I realized the filter with grep, using -E, which determines that I will be applying regular expression.
# Also, I sorted my results with sort, using the parameters -n (so that its order in numerical order) and -k (to specify the reference column).
# The parameter -F "\t" from awk specifies that the delimitation of columns is by tabs.
grep -E "(COPD|FVC|FEV1|Chronic Obstructive Airway Disease|Chronic Lung Injury|Lung Diseases Obstructive|Emphysema)" Data/Gwas_Clinvar_Gtex_dbSNP_Joined.bed | awk -F"\t" '{OFS=FS}{if ($3 == "NA") $3 = $20}{if ($15 == "NA") $15 = $22}{if ($16 == "NA") $16 = $4}{if ($3 != "NA") print $1 "\t" $2 "\t" $3 "\t" $15 "\t" $16}' | sort -nk 2 | uniq > Data/disease-associated-variants.txtGene associated variants
To retrieve the variants associated to the genes, first I created a file with all the genes from the meta-analysis, with the corresponding filters (Tissue = qval.random <= 0.05 & I2 < 0.4 & num_exp >= 9 and blood = qval.random <= 0.05 & I2 < 0.4 & num_exp >= 4). I obtained 1,716 unique genes (161 from tissue and 1570 from blood).
Tissue_meta_analysis_f <- filter(Tissue_meta_analysis, qval.random <= 0.05 & I2 < 0.4 & num_exp >= 9)
Blood_meta_analysis_f <- filter(Blood_meta_analysis, qval.random <= 0.05 & I2 < 0.4 & num_exp >= 4)
write.csv(unique(c(Tissue_meta_analysis_f$X, Blood_meta_analysis_f$X)), "Data/meta-genes.txt", row.names = F, col.names = F, quote = F)I searched for the variants associated with the final genes from the meta-analysis (1,716). I retrieved 70,134 variants from here.
# In this case, I use grep with -w (retrieves variants only if the whole word matches) and -f (specifies that it will be using a file in the search), and then I create the same file as with the diseases.
grep -wf Data/meta-genes.txt Data/Gwas_Clinvar_Gtex_dbSNP_Joined.bed | awk -F"\t" '{OFS=FS}{if ($3 == "NA") $3 = $20}{if ($15 == "NA") $15 = $22}{if ($16 == "NA") $16 = $4}{if ($3 != "NA") print $1 "\t" $2 "\t" $3 "\t" $15 "\t" $16}' | sort -nk 2 | uniq > Data/genes-associated-variants.txtVariants concatenation and filtering
I joined both of my files and end up with 74,142 variants. Then, I looked for the coordinates of the genes from UCSC genome browser and assure that the variants were close to the genes. By doing this I ended up with 59,648 variants to begin with the analysis.
Once I had all the variants ID I created the corresponding files for the analysis. I created a bed file in which I kept the chromosome name, the start position, and the end. I also needed a vcf file; I base the columns that I needed from the vcf manual. In this step, I also put on the corresponding header in each file.
cat Data/disease-associated-variants.txt Data/genes-associated-variants.txt | sort -nk 2 | uniq > Data/all_variants_NoFilter.txtall_variants_NoFilter <- read.table("Data/all_variants_NoFilter.txt")
colnames(all_variants_NoFilter) <- c("Chr", "Pos", "ID", "Ref", "Alt")
all_variants_NoFilter$Chr <- gsub("chr", "", all_variants_NoFilter$Chr)
all_variants_Filter <- unique(unlist(sapply(c(1:nrow(UCSC_genes)), function(x){return(filter(all_variants_NoFilter, (Chr == UCSC_genes$chrom[x]) & (Pos >= (UCSC_genes$cdsStart[x]-5000)) & (Pos <= (UCSC_genes$cdsStart[x]+5000)))$ID)})))
writeLines(c("##fileformat=VCFv4.2","#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"), "Data/all_variants.vcf")
write.table(data.frame(all_variants_NoFilter[all_variants_NoFilter$ID %in% all_variants_Filter,], qual = ".", filter = ".", info = "."), "Data/all_variants.vcf", row.names = F, col.names = F, quote = F, sep = "\t", append = T)
bed <- all_variants_NoFilter[all_variants_NoFilter$ID %in% all_variants_Filter, c(1, 2)]
bed$start <- bed$Pos -1
writeLines(c("chrom \t chromStar \t chromEnd"), "Data/all_variants.bed")
write.table(bed[,c(1, 3, 2)], "Data/all_variants.bed", row.names = F, col.names = F, quote = F, sep = "\t", append = T)
# Need to remove R and Y, because RSAT don't recognize IUPAC annotation. # I create a file with the JASPAR TF IDs since I'll be needing this later.
grep '>' Data/JASPAR2020_CORE_vertebrates_non-redundant_pfms_jaspar.txt | tr -d '>' > Data/JASPAR_IDs.txtIdentify affected TFBS
To identify the affected TFBS. I analyze motif changes with variation tools from Regulatory Sequence Analysis Tools (RSAT).
First, I needed to create a background model, so I used fetch-sequences, which receives a BED file and collects the sequences from the UCSC genome browser. I chose to retrieve 100pb upstream and 100pb downstream from the variant. And then, given that sequence, I created my background model with the tool create-background-model.
I used convert-variations to retrieve a varBed file from a VCF; this is a format from RSAT that facilitates the processing of the information of the variants. Then I used retrieve-variation-seq, which gives you the sequences surrounding the variant (30pb). Finally, I used variation-scan, which provides you with the probability of having a new instance in the sequence from the variant given the background model binding site (created before); this is done by processing different variants with the JASPAR non-redundant motif database and comparing the scores and p-values to find the TFBS that are being affected.
# qsub varScan.sge
rsat fetch-sequences -i Data/all_variants.bed -o Results/RSAT/all_variants_bg.fasta -v 1 -downstr_ext 100 -upstr_ext 100 -genome hg38
rsat create-background-model -v 1 -i Results/RSAT/all_variants_bg.fasta -markov 2 -out_format transitions -o Results/RSAT/all_variants_bg_model.txt
rsat convert-variations -v 1 -i Data/all_variants.vcf -from vcf -to varBed -o Results/RSAT/all_variants.varBed
rsat retrieve-variation-seq -species Homo_sapiens -assembly GRCh38 -i Results/RSAT/all_variants.varBed -format varBed -mml 30 -v 1 -o Results/RSAT/all_variants.varSeq
rsat variation-scan -i Results/RSAT/all_variants.varSeq -m /cm/shared/apps/rsat/rsat/public_html/motif_databases/JASPAR/Jaspar_2020/nonredundant/JASPAR2020_CORE_vertebrates_non-redundant_pfms.tf -m_format transfac -bg Results/RSAT/all_variants_bg_model.txt -o Results/RSAT/all_variants_varScan.tsv -v 1 -mml 30 -lth score 1 -lth w_diff 1 -lth pval_ratio 10 -uth pval 1e-3Once I had the results from variation-Scan, I analyzed them. To do that, I filtered just the ones that fulfill quality requirements, and using the JASPAR data, I added a column with the corresponding TF from the motif.
COPD_varScan <- read.csv("Results/RSAT/all_variants_varScan.tsv", header = T, sep = "\t", comment.char = ";") %>% filter(pval_ratio > 10 & (best_w == 0 | worst_w == 0 | (best_w * worst_w) < 0) & best_pval < 1e-4 ) %>% merge(JASPAR_IDs, by = "X.ac_motif", all.x = T)
COPD_varScan <- COPD_varScan[,c(1, 22, 2:21)]To make sure that my variants were affecting the TF, I realized two different negative controls. One with vSampler and the other one with matrix permutations, both explained later. And I compute both of them for tissue and blood data independently.
Tissue negative control
For the tissue, I used the Tissue meta-analysis with the filtered previously mentioned. From them, I searched for the TFs on this data-set tissue data.
Tissue_COPD_varScan <- COPD_varScan[COPD_varScan$TF %in% Tissue_meta_analysis$X,]vSampler
vSampler is a tool that receives the list of variants you’re using, and for each one of them retrieves random variants given eight different properties. This is used to compute a negative control.
To use vSampler, I needed to create a file with the chromosome name and the variant’s position. For that, I used only the significant ones from variationScan and created the input file.
Tissue_coords <- as.data.frame(matrix(unlist(str_split(unlist(str_split(unlist(str_split(Tissue_COPD_varScan$var_coord, ":")), "-")), "_")), ncol = 4, byrow = T))[,(1:2)]
write.table(Tissue_coords, file = "Data/Tissue_vSampler_Coord-Only.txt", append = F, sep = "\t", row.names = F, col.names = F, quote = F)
rm(Tissue_coords)I ran vSampler online. And given the result list of variants, I created the vcf file and the bed file. Then, I ran all the same steps as before from RSAT-tools.
grep control1 Data/Tissue_vSampler/anno.out.txt | awk '{print $2 "\t" $3-1 "\t" $3}' | sed -e '1 i\chrom \t chromStar \t chromEnd' > Data/TissueNC_vSampler.bed
grep control1 Data/Tissue_vSampler/anno.out.txt | awk '{print $2 "\t" $3 "\t" "." "\t" $4 "\t" $5 "\t" "." "\t" "." "\t" "."}' | sed -e '1 i\##fileformat=VCFv4.2\n#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO' > Data/TissueNC_vSampler.vcf
rsat fetch-sequences -i Data/TissueNC_vSampler.bed -o Results/RSAT/TissueNC_vSampler_bg.fasta -v 1 -downstr_ext 100 -upstr_ext 100 -genome hg38
rsat create-background-model -v 1 -i Results/RSAT/TissueNC_vSampler_bg.fasta -markov 2 -out_format transitions -o Results/RSAT/TissueNC_vSampler_bg_model.txt
rsat convert-variations -v 1 -i Data/TissueNC_vSampler.vcf -from vcf -to varBed -o Results/RSAT/TissueNC_vSampler.varBed
rsat retrieve-variation-seq -species Homo_sapiens -assembly GRCh38 -i Results/RSAT/TissueNC_vSampler.varBed -format varBed -mml 30 -v 1 -o Results/RSAT/TissueNC_vSampler.varSeq
rsat variation-scan -i Results/RSAT/TissueNC_vSampler.varSeq -m /cm/shared/apps/rsat/rsat/public_html/motif_databases/JASPAR/Jaspar_2020/nonredundant/JASPAR2020_CORE_vertebrates_non-redundant_pfms.tf -m_format transfac -bg Results/RSAT/TissueNC_vSampler_bg_model.txt -o Results/RSAT/TissueNC_vSampler_varScan.tsv -v 1 -mml 30 -lth w_diff 1 -lth pval_ratio 10 -uth pval 1e-3Matrix permutation
I retrieved the Jaspar PFM matrix from the previously filtered TFs and then saved it into a file.
Jaspar_Tissue_PFMatrix <- getMatrixByID(JASPAR2020, ID = Tissue_COPD_varScan$X.ac_motif)
write_transfac(Jaspar_Tissue_PFMatrix, "Data/TissueNC_PFMatrix.txt", overwrite = T)
rm(Jaspar_Tissue_PFMatrix)Again I used some tools from RSAT to perform the negative control. To do that, I created five permutations of my matrices and then ran again variation-scan with the new matrix file.
# Because the transfac format has the name of the motif in AC, in order to function I needed to correct that in the matrix file.
cat Data/TissueNC_PFMatrix.txt | sed -e 's/NA/AC/' > Data/TissueNC_PFMatrix.tf
rm Data/TissueNC_PFMatrix.txt
rsat permute-matrix -v 0 -in_format transfac -i Data/TissueNC_PFMatrix.tf -perm 5 -out_format transfac -o Results/RSAT/TissueNC_5perm_transfac.tf
rsat variation-scan -i Results/RSAT/all_variants.varSeq -m Results/RSAT/TissueNC_5perm_transfac.tf -m_format transfac -bg Results/RSAT/all_variants_bg_model.txt -o Results/RSAT/tissueNC_5perm_varScan.tsv -v 1 -mml 30 -lth score 1 -lth w_diff 1 -lth pval_ratio 10 -uth pval 1e-3Tissue negative control distribution analysis
To scan both variation-scan results from both of my controls independently, I calculated the distribution of each motif’s pval ratio, which is the ratio between the worst pval and the best one (the bigger the ratio, the most significant it will be). Then, I saw the distribution on the control motif for each variant and kept just the significant ones.
# I started by loading the results from variation-scan from vSampler and created the corresponding data to create the distribution.
TissueNC_vSampler_varScan <- read.csv("Results/RSAT/TissueNC_vSampler_varScan.tsv", header = T, sep = "\t", comment.char = ";")
TissueNC_vSampler_varScan <- merge(TissueNC_vSampler_varScan, JASPAR_IDs, by = "X.ac_motif", all.x = T)
TissueNC_vSampler_varScan <- TissueNC_vSampler_varScan[,c(1, 22, 2:21)]
# I did the same with the results from variation-scan of the permutations.
TissueNC_5perm_varScan <- read.csv("Results/RSAT/TissueNC_5perm_varScan.tsv", header = T, sep = "\t", comment.char = ";")
TissueNC_5perm_varScan$perm <- unlist(as.data.frame(matrix(unlist(str_split(TissueNC_5perm_varScan$X.ac_motif, "_")), ncol = 2, byrow = T))[2])
TissueNC_5perm_varScan$X.ac_motif <- unlist(as.data.frame(matrix(unlist(str_split(TissueNC_5perm_varScan$X.ac_motif, "_")), ncol = 2, byrow = T))[1])
TissueNC_5perm_varScan <- merge(TissueNC_5perm_varScan, JASPAR_IDs, by = "X.ac_motif", all.x = T)
TissueNC_5perm_varScan <- TissueNC_5perm_varScan[,c(1, 23, 2:22)]
# Using some functions created before I realize the distributions.
Tissue_vSampler_distribution <- unlist(sapply(X = unique(Tissue_COPD_varScan$var_id), function(x){distribution_analysis(x, Tissue_COPD_varScan, TissueNC_vSampler_varScan)}))
Tissue_vSampler_distribution <- Tissue_vSampler_distribution[Tissue_vSampler_distribution < 0.05]
Tissue_vSampler_distribution <- cbind(as.data.frame(matrix(unlist(str_split(names(Tissue_vSampler_distribution), "\\.")), ncol = 2, byrow = T)), unlist(Tissue_vSampler_distribution))
colnames(Tissue_vSampler_distribution) <- c("var_id", "TF", "pvalue")
Tissue_perm_distribution <- unlist(sapply(X = unique(Tissue_COPD_varScan$var_id), function(x){distribution_analysis(x, Tissue_COPD_varScan, TissueNC_5perm_varScan)}))
Tissue_perm_distribution <- Tissue_perm_distribution[Tissue_perm_distribution < 0.05]
Tissue_perm_distribution <- cbind(as.data.frame(matrix(unlist(str_split(names(Tissue_perm_distribution), "\\.")), ncol = 2, byrow = T)), unlist(Tissue_perm_distribution))
colnames(Tissue_perm_distribution) <- c("var_id", "TF", "pvalue")
# Filter from my original data just the ones that were significant on the distributions given my negatives controls.
Tissue_significant_var <- unique(rbind(Tissue_perm_distribution, Tissue_vSampler_distribution)[c("var_id", "TF")])
Tissue_significant_varScan <- merge(Tissue_significant_var, Tissue_COPD_varScan, by = c("var_id", "TF"))
Tissue_significant_varScan <- Tissue_significant_varScan[,c(3, 2, 1, 4:22)]
rm(TissueNC_vSampler_varScan)
rm(TissueNC_5perm_varScan)
rm(Tissue_vSampler_distribution)
rm(Tissue_perm_distribution)Blood negative control
I used the blood meta-analysis realized by Ana B and filtered just the TFs on the blood data.
Blood_COPD_varScan <- COPD_varScan[COPD_varScan$TF %in% Blood_meta_analysis$X,]vSampler
Here I did the same as with the tissue data, so I created the corresponding file with the variant’s position and ran vSampler online.
Blood_coords <- as.data.frame(matrix(unlist(str_split(unlist(str_split(unlist(str_split(Blood_COPD_varScan$var_coord, ":")), "-")), "_")), ncol = 4, byrow = T))[,(1:2)]
write.table(Blood_coords, file = "Data/Blood_vSampler_Coord-Only.txt", append = F, sep = "\t", row.names = F, col.names = F, quote = F)
rm(Blood_coords)Given my result list of variants, I created the vcf file and the bed file. Then, I ran all the steps from RSAT-tools.
grep control1 Data/Blood_vSampler/anno.out.txt | awk '{print $2 "\t" $3-1 "\t" $3}' | sed -e '1 i\chrom \t chromStar \t chromEnd' > Data/BloodNC_vSampler.bed
grep control1 Data/Blood_vSampler/anno.out.txt | awk '{print $2 "\t" $3 "\t" "." "\t" $4 "\t" $5 "\t" "." "\t" "." "\t" "."}' | sed -e '1 i\##fileformat=VCFv4.2\n#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO' > Data/BloodNC_vSampler.vcf
rsat fetch-sequences -i Data/BloodNC_vSampler.bed -o Results/RSAT/BloodNC_vSampler_bg.fasta -v 1 -downstr_ext 100 -upstr_ext 100 -genome hg38
rsat create-background-model -v 1 -i Results/RSAT/BloodNC_vSampler_bg.fasta -markov 2 -out_format transitions -o Results/RSAT/BloodNC_vSampler_bg_model.txt
rsat convert-variations -v 1 -i Data/BloodNC_vSampler.vcf -from vcf -to varBed -o Results/RSAT/BloodNC_vSampler.varBed
rsat retrieve-variation-seq -species Homo_sapiens -assembly GRCh38 -i Results/RSAT/BloodNC_vSampler.varBed -format varBed -mml 30 -v 1 -o Results/RSAT/BloodNC_vSampler.varSeq
rsat variation-scan -i Results/RSAT/BloodNC_vSampler.varSeq -m /cm/shared/apps/rsat/rsat/public_html/motif_databases/JASPAR/Jaspar_2020/nonredundant/JASPAR2020_CORE_vertebrates_non-redundant_pfms.tf -m_format transfac -bg Results/RSAT/BloodNC_vSampler_bg_model.txt -o Results/RSAT/BloodNC_vSampler_varScan.tsv -v 1 -mml 30 -lth w_diff 1 -lth pval_ratio 10 -uth pval 1e-3Matrix permutation
Just as with tissue, I retrieved the Jaspar PFM matrix and then ran the corresponding tools from RSAT.
Jaspar_Blood_PFMatrix <- getMatrixByID(JASPAR2020, ID = Blood_COPD_varScan$X.ac_motif)
write_transfac(Jaspar_Blood_PFMatrix, "Data/BloodNC_PFMatrix.txt", overwrite = T)
rm(Jaspar_Blood_PFMatrix)# Because the transfac format has the name of the motif in AC, in order to function I needed to correct that in the matrix file.
cat Data/BloodNC_PFMatrix.txt | sed -e 's/NA/AC/' > Data/BloodNC_PFMatrix.tf
rm Data/BloodNC_PFMatrix.txt
rsat permute-matrix -v 0 -in_format transfac -i Data/BloodNC_PFMatrix.tf -perm 5 -out_format transfac -o Results/RSAT/BloodNC_5perm_transfac.tf
rsat variation-scan -i Results/RSAT/all_variants.varSeq -m Results/RSAT/BloodNC_5perm_transfac.tf -m_format transfac -bg Results/RSAT/all_variants_bg_model.txt -o Results/RSAT/BloodNC_5perm_varScan.tsv -v 1 -mml 30 -lth score 1 -lth w_diff 1 -lth pval_ratio 10 -uth pval 1e-3Blood negative control distribution analysis
I ran the same functions but with my blood results to analyze the distribution on the control motif for each variant and kept just the significant ones.
# I started by loading the results from variation-scan from vSampler and created the corresponding data to create the distribution.
BloodNC_vSampler_varScan <- read.csv("Results/RSAT/BloodNC_vSampler_varScan.tsv", header = T, sep = "\t", comment.char = ";")
BloodNC_vSampler_varScan <- merge(BloodNC_vSampler_varScan, JASPAR_IDs, by = "X.ac_motif", all.x = T)
BloodNC_vSampler_varScan <- BloodNC_vSampler_varScan[,c(1, 22, 2:21)]
# I did the same with the results from variation-scan of the permutations.
BloodNC_5perm_varScan <- read.csv("Results/RSAT/BloodNC_5perm_varScan.tsv", header = T, sep = "\t", comment.char = ";")
BloodNC_5perm_varScan$perm <- unlist(as.data.frame(matrix(unlist(str_split(BloodNC_5perm_varScan$X.ac_motif, "_")), ncol = 2, byrow = T))[2])
BloodNC_5perm_varScan$X.ac_motif <- unlist(as.data.frame(matrix(unlist(str_split(BloodNC_5perm_varScan$X.ac_motif, "_")), ncol = 2, byrow = T))[1])
BloodNC_5perm_varScan <- merge(BloodNC_5perm_varScan, JASPAR_IDs, by = "X.ac_motif", all.x = T)
BloodNC_5perm_varScan <- BloodNC_5perm_varScan[,c(1, 23, 2:22)]
# Using some functions created before I realize the distributions.
Blood_vSampler_distribution <- unlist(sapply(X = unique(Blood_COPD_varScan$var_id), function(x){distribution_analysis(x, Blood_COPD_varScan, BloodNC_vSampler_varScan)}))
Blood_vSampler_distribution <- Blood_vSampler_distribution[Blood_vSampler_distribution < 0.05]
Blood_vSampler_distribution <- cbind(as.data.frame(matrix(unlist(str_split(names(Blood_vSampler_distribution), "\\.")), ncol = 2, byrow = T)), unlist(Blood_vSampler_distribution))
colnames(Blood_vSampler_distribution) <- c("var_id", "TF", "pvalue")
Blood_perm_distribution <- unlist(sapply(X = unique(Blood_COPD_varScan$var_id), function(x){distribution_analysis(x, Blood_COPD_varScan, BloodNC_5perm_varScan)}))
Blood_perm_distribution <- Blood_perm_distribution[Blood_perm_distribution < 0.05]
Blood_perm_distribution <- cbind(as.data.frame(matrix(unlist(str_split(names(Blood_perm_distribution), "\\.")), ncol = 2, byrow = T)), unlist(Blood_perm_distribution))
colnames(Blood_perm_distribution) <- c("var_id", "TF", "pvalue")
# Filter from my original data just the ones that were significant on the distributions given my negatives controls.
Blood_significant_var <- unique(rbind(Blood_perm_distribution, Blood_vSampler_distribution)[c("var_id", "TF")])
Blood_significant_varScan <- merge(Blood_significant_var, Blood_COPD_varScan, by = c("var_id", "TF"))
Blood_significant_varScan <- Blood_significant_varScan[,c(3, 2, 1, 4:22)]
rm(BloodNC_vSampler_varScan)
rm(BloodNC_5perm_varScan)
rm(Blood_vSampler_distribution)
rm(Blood_perm_distribution)Gene association
Given my results I associated each variant with the closest gene given the meta-analysis realized by Ana B.
I’ll be using the Transcripts per Million (TPM) from the Genotype-Tissue Expression (GTEx) to select just the TFs that are active in normal conditions. So the next code is to select the data that I’ll be using later on.
awk -F "\t" '{print $1 "\t" $2 "\t" $39 "\t" $56}' GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct > GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm_lung_blood.csvVariant Effect Predictor
To do the gene search I used Ensembl Variant Effect Predictor (VEP) from ensembl and I looked for 5000 pb upstream and 1000 pb downstream from the variant.
# Here I create a file with the variant id so that I can create a vcf file just with the variants that I'll be using for my tissue and blood data.
write.table(unique(Tissue_significant_varScan$var_id), "Results/Tissue_significant_varID_varScan.txt", sep = "\t", quote = F, col.names = F, row.names = F)
# grep -wf Results/Tissue_significant_varID_varScan.txt Data/all_variants.vcf > Results/Tissue_significant_varID_varScan.vcf
write.table(unique(Blood_significant_varScan$var_id), "Results/Blood_significant_varID_varScan.txt", sep = "\t", quote = F, col.names = F, row.names = F)
# grep -wf Results/Blood_significant_varID_varScan.txt Data/all_variants.vcf > Results/Blood_significant_varID_varScan.vcfI ran VEP online and download my results in R.
Tissue VEP results
To know if the effect is really happening between the genes and the TFs, I search in GTEx the Transcripts Per Million (TPM) of lung and blood and remove all the genes and TFs that had a TPM of 0, from my results of VEP.
Tissue_VEP <- read.csv("Results/Tissue_VEP.txt", sep = "\t")
Tissue_VEP <- Tissue_VEP[Tissue_VEP$SYMBOL %in% Tissue_meta_analysis_f$X,]
Tissue_data <- merge(data.frame(merge(Tissue_meta_analysis, Tissue_VEP, by.x = "X", by.y = "SYMBOL" )), Tissue_significant_varScan, by.x = "X.Uploaded_variation", by.y = "var_id")
Tissue_data$TF.TE.random <- sapply(Tissue_data$TF, function(x){return(as.numeric(filter(Tissue_meta_analysis, X == x)$TE.random))})
Tissue_data <- Tissue_data[,c(1, 59, 53, 50, 51, 71, 2, 9, 3, 5, 6, 12:14, 19, 20, 29, 41)]
Tissue_data <- unique(Tissue_data)
colnames(Tissue_data) <- c("var_id", "varScan_pval.ratio", "var_coord", "motif", "TF", "TF_TE.random", "gene", "gene_qval.random", "gene_TE.random", "gene_tau2", "gene_I2", "ALT", "Consequence", "IMPACT", "EXON", "INTRON", "DISTANCE", "CLIN_SIG")
# GTEx filter
GTEx_tissue_data <- rbind(GTEx_data[GTEx_data$Description %in% Tissue_data$TF,], GTEx_data[GTEx_data$Description %in% Tissue_data$gene,])
Tissue_data <- Tissue_data[Tissue_data$TF %in% GTEx_tissue_data[GTEx_tissue_data$Lung != 0,]$Description,]
Tissue_data <- Tissue_data[Tissue_data$gene %in% GTEx_tissue_data[GTEx_tissue_data$Lung != 0,]$Description,]
head(Tissue_data) %>% htmlTable| var_id | varScan_pval.ratio | var_coord | motif | TF | TF_TE.random | gene | gene_qval.random | gene_TE.random | gene_tau2 | gene_I2 | ALT | Consequence | IMPACT | EXON | INTRON | DISTANCE | CLIN_SIG | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | rs1002699754 | 31.96 | 2:61017812-61017813_+ | MA0137.3 | STAT1 | 0.0379079596826783 | PEX13 | 0.0125637541750527 | 0.115720512184952 | 0.00308264978286661 | 0.379060316925231 | T | synonymous_variant | LOW | 1/4 | - | - | - |
| 3 | rs1002699754 | 31.96 | 2:61017812-61017813_+ | MA0137.3 | STAT1 | 0.0379079596826783 | PEX13 | 0.0125637541750527 | 0.115720512184952 | 0.00308264978286661 | 0.379060316925231 | C | synonymous_variant | LOW | 1/4 | - | - | - |
| 5 | rs1002699754 | 31.96 | 2:61017812-61017813_+ | MA0137.3 | STAT1 | 0.0379079596826783 | PEX13 | 0.0125637541750527 | 0.115720512184952 | 0.00308264978286661 | 0.379060316925231 | C | synonymous_variant | LOW | 2/3 | - | - | - |
| 7 | rs1002699754 | 31.96 | 2:61017812-61017813_+ | MA0137.3 | STAT1 | 0.0379079596826783 | PEX13 | 0.0125637541750527 | 0.115720512184952 | 0.00308264978286661 | 0.379060316925231 | T | synonymous_variant | LOW | 1/2 | - | - | - |
| 9 | rs1002699754 | 31.96 | 2:61017812-61017813_+ | MA0137.3 | STAT1 | 0.0379079596826783 | PEX13 | 0.0125637541750527 | 0.115720512184952 | 0.00308264978286661 | 0.379060316925231 | A | synonymous_variant | LOW | 1/4 | - | - | uncertain_significance |
| 11 | rs1002699754 | 31.96 | 2:61017812-61017813_+ | MA0137.3 | STAT1 | 0.0379079596826783 | PEX13 | 0.0125637541750527 | 0.115720512184952 | 0.00308264978286661 | 0.379060316925231 | A | synonymous_variant | LOW | 2/3 | - | - | uncertain_significance |
Figure 2
Figure 2: CCL2
Blood VEP results
I did the same for my blood results. Since I got a huge list from VEP that were positive, I did an extra filter. So, I looked for the coordinates of the genes from UCSC genome browser and assure that the gene was between my variant. Doing this filter helped me to remove the ones that weren’t close to the gene.
Blood_VEP <- read.csv("Results/Blood_VEP.txt", sep = "\t")
Blood_VEP <- Blood_VEP[Blood_VEP$SYMBOL %in% Blood_meta_analysis_f$X,]
Blood_data <- merge(data.frame(merge(Blood_meta_analysis, Blood_VEP, by.x = "X", by.y = "SYMBOL" )), Blood_significant_varScan, by.x = "X.Uploaded_variation", by.y = "var_id")
Blood_data$TF.TE.random <- sapply(Blood_data$TF, function(x){return(filter(Blood_meta_analysis, X == x)$TE.random)})
Blood_data <- Blood_data[,c(1, 59, 53, 50, 51, 71, 2, 9, 3, 5, 6, 12:14, 19, 20, 29, 41)]
Blood_data <- unique(Blood_data)
colnames(Blood_data) <- c("var_id", "varScan_pval.ratio", "var_coord", "motif", "TF", "TF_TE.random", "gene", "gene_qval.random", "gene_TE.random", "gene_tau2", "gene_I2", "ALT", "Consequence", "IMPACT", "EXON", "INTRON", "DISTANCE", "CLIN_SIG")
# UCSC genes
Blood_UCSC_genes <- UCSC_genes[UCSC_genes$name2 %in% Blood_meta_analysis_f$X,]
Blood_varScan_coord <- as.data.frame(matrix(unlist(str_split(unlist(str_split(unlist(str_split(Blood_significant_varScan$var_coord, ":")), "-")), "_")), ncol = 4, byrow = T))
Blood_genes <- unique(unlist(sapply(c(1:nrow(Blood_significant_varScan)), function(x){return(filter(Blood_UCSC_genes, (chrom == Blood_varScan_coord$V1[x]) & ((cdsStart-5000) <= Blood_varScan_coord$V2[x]) & ((cdsStart+5000) >= Blood_varScan_coord$V3[x]))$name2)})))
Blood_data <- Blood_data[Blood_data$gene %in% Blood_genes,]
# GTEx filter
GTEx_blood_data <- rbind(GTEx_data[GTEx_data$Description %in% Blood_data$TF,], GTEx_data[GTEx_data$Description %in% Blood_data$gene,])
Blood_data <- Blood_data[Blood_data$TF %in% GTEx_blood_data[GTEx_blood_data$`Whole Blood` != 0,]$Description,]
Blood_data <- Blood_data[Blood_data$gene %in% GTEx_blood_data[GTEx_blood_data$`Whole Blood` != 0,]$Description,]
head(Blood_data) %>% htmlTable| var_id | varScan_pval.ratio | var_coord | motif | TF | TF_TE.random | gene | gene_qval.random | gene_TE.random | gene_tau2 | gene_I2 | ALT | Consequence | IMPACT | EXON | INTRON | DISTANCE | CLIN_SIG | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | rs10015223 | 76 | 4:4387146-4387147_+ | MA0691.1 | TFAP4 | -0.0258832292174166 | NSG1 | 0.0166089583403988 | -0.205856808798565 | 0.00156274318763709 | 0 | G | 5_prime_UTR_variant | MODIFIER | 2/6 | - | - | - |
| 2 | rs10015223 | 29.09 | 4:4387146-4387147_+ | MA0691.1 | TFAP4 | -0.0258832292174166 | NSG1 | 0.0166089583403988 | -0.205856808798565 | 0.00156274318763709 | 0 | G | 5_prime_UTR_variant | MODIFIER | 2/6 | - | - | - |
| 3 | rs10015223 | 76 | 4:4387146-4387147_+ | MA0691.1 | TFAP4 | -0.0258832292174166 | NSG1 | 0.0166089583403988 | -0.205856808798565 | 0.00156274318763709 | 0 | G | 5_prime_UTR_variant | MODIFIER | 1/5 | - | - | - |
| 4 | rs10015223 | 29.09 | 4:4387146-4387147_+ | MA0691.1 | TFAP4 | -0.0258832292174166 | NSG1 | 0.0166089583403988 | -0.205856808798565 | 0.00156274318763709 | 0 | G | 5_prime_UTR_variant | MODIFIER | 1/5 | - | - | - |
| 5 | rs10015223 | 76 | 4:4387146-4387147_+ | MA0691.1 | TFAP4 | -0.0258832292174166 | NSG1 | 0.0166089583403988 | -0.205856808798565 | 0.00156274318763709 | 0 | G | 5_prime_UTR_variant | MODIFIER | 4/8 | - | - | - |
| 6 | rs10015223 | 29.09 | 4:4387146-4387147_+ | MA0691.1 | TFAP4 | -0.0258832292174166 | NSG1 | 0.0166089583403988 | -0.205856808798565 | 0.00156274318763709 | 0 | G | 5_prime_UTR_variant | MODIFIER | 4/8 | - | - | - |
There’re 1964 variants. I ran all of them in VEP, and I obtain the same 1964 variants, associated with 966 genes. From there I first remove the ones that weren’t in our filter lood meta-analysis (1570 genes) and retrieve 427 genes. Then I did another filter with the UCSC, removing all the genes that aren’t close to the variants, we obtain 386 genes. After that, we remove the TFs that in normal contidions are not being expressed (TPM == 0), so finally we find 354 significant genes.
If I do the analysis without VEP, and only search for the genes that are close to the variants, I obtain 394 genes. The main difference is that with VEP I obtain more information about the association.
Result analysis
Now that I have all the results, I’m going to be doing the final tables to be able to interpret my results better. To do that I’ll start with some tables that will have different information all together and then some figures with the best association.
Variants and TFs for each gene (Table 1)
gene_variants_tfs_function <- function(x, data, s){
gene_data <- data %>% filter(gene == x)
return(c(x, toString(unique(gene_data$var_id)), toString(unique(gene_data$TF)), s))
}
gene_variants_tfs <- rbind(as.data.frame(t(sapply(X = unique(Tissue_data$gene), function(x){return(gene_variants_tfs_function(x, Tissue_data, "tissue"))}))), as.data.frame(t(sapply(X = unique(Blood_data$gene), function(x){return(gene_variants_tfs_function(x, Blood_data, "blood"))}))))
rownames(gene_variants_tfs) <- NULL
colnames(gene_variants_tfs) <- c("gene", "variants", "TF", "analysis")
write.table(gene_variants_tfs, file = "Results/Tables/Gene_variants_tfs.csv", row.names = F, quote = F, sep = "\t")
head(gene_variants_tfs) %>% htmlTable| gene | variants | TF | analysis | |
|---|---|---|---|---|
| 1 | PEX13 | rs1002699754, rs771610641 | STAT1, ZSCAN4 | tissue |
| 2 | HBB | rs1005042281, rs1055395965, rs1141370, rs1160543272, rs1250042245, rs1302097141, rs1320832744, rs1554917555, rs193922551, rs193922558, rs33913413, rs33919924, rs33922842, rs33922873, rs33925391, rs33926796, rs33927093, rs33929459, rs33930385, rs33931779, rs33931806, rs33932070, rs33933298, rs33937535, rs33941377, rs33944208, rs33945546, rs33946157, rs33946775, rs33951978, rs33952266, rs33952543, rs33953406, rs33954595, rs33958088, rs33958637, rs33959340, rs33961444, rs33974228, rs33974936, rs33976006, rs33978082, rs33978338, rs33985510, rs33987957, rs33990858, rs33991294, rs33991472, rs33991993, rs33994806, rs34095019, rs34126315, rs34227486, rs34515413, rs34726542, rs34999973, rs35286210, rs35328027, rs35747961, rs35890959, rs63750400, rs63750628, rs63750840, rs713040, rs760975738, rs901033731 | HNF1B, TEAD3, GCM1, HOXD10, HOXC10, HOXC9, POU2F2, ATF3, ZNF24, LHX9, LMX1A, TLX2, LHX6, EMX2, HOXB3, NKX6-1, HESX1, PRRX2, LBX2, NKX6-2, DRGX, PRRX1, MIXL1, ZEB1, STAT3, RHOXF1, ZNF75D, OSR1, ZNF274, NR6A1, BARX1, ESR2, NR3C2, HOXA2, HOXD3, HOXD4, HOXA5, HOXB5, HOXB4, HOXA4, HOXC4, NR1I3, NKX2-8, NR5A1, TFAP2E, FOSL2, ASCL1, SOHLH2, MEIS2, MEIS1, KLF5, KLF6, KLF2, YY1, TFAP2B, CTCF, TBX6, TBX3, TBX15, TBX5, ELK4, ELF4, ELF5, DMRT3, JUN, MAX, TFEB, MLXIPL, FOS, HOXD8, PPARD, RXRB, RXRG, CREB3L1, RFX4, RFX5, SNAI3, OLIG1, TFEC, HIF1A, RFX2, HOXD9, CDX1, HOXA10, GFI1, MYB, ZBTB18, ZNF135 | tissue |
| 3 | WFS1 | rs1014892217 | NR2C1, NR2F1, ESR2 | tissue |
| 4 | CENPM | rs1023497, rs5758511 | HSF2, EGR3 | tissue |
| 5 | ACADVL | rs1023643662, rs1032857886, rs1057523521, rs112406105, rs118204015, rs139425622, rs141167669, rs144255994, rs1458941582, rs1555526667, rs1555527732, rs1567565417, rs374911841, rs377044444, rs387906253, rs398123081, rs398123083, rs545215807, rs727503788, rs749159573, rs771874163, rs775761275, rs77763289, rs777751102, rs780020193, rs796051907, rs796051918, rs886044100, rs890862631 | KLF5, MAZ, HIC2, ESR2, BARHL1, RHOXF1, ASCL1, TFAP4, NKX2-8, TCF7L2, TFAP2E, GCM1, KLF2, KLF6, NR2C1, TGIF2, TBX4, TBX2, NR2F1, TBX3, TBX6, ZFP42, NR3C2, NR3C1, FIGLA, OLIG1, NR5A1, NHLH2, NHLH1, FOS, GATA5, SNAI1, ETV4, NFATC4 | tissue |
| 6 | CCL2 | rs1024611 | TFAP4 | tissue |
variant, gene and TFs association (Table 2)
variant_gene_tf <- rbind(cbind(unique(Tissue_data[,c(1, 5, 6, 7, 9)]), analysis = rep("tissue", times = nrow(unique(Tissue_data[,c(1, 5, 6, 7, 9)])))), cbind(unique(Blood_data[,c(1, 5, 6, 7, 9)]), analysis = rep("blood", times = nrow(unique(Blood_data[,c(1, 5, 6, 7, 9)])))))
write.table(variant_gene_tf, file = "./Results/Tables/Variant_gene_tf.csv", row.names = F, quote = F, sep = "\t")
head(variant_gene_tf) %>% htmlTable()| var_id | TF | TF_TE.random | gene | gene_TE.random | analysis | |
|---|---|---|---|---|---|---|
| 1 | rs1002699754 | STAT1 | 0.0379079596826783 | PEX13 | 0.115720512184952 | tissue |
| 33 | rs1005042281 | HNF1B | -0.0172428699479146 | HBB | 0.179282484897724 | tissue |
| 103 | rs1014892217 | NR2C1 | -0.0913787642762126 | WFS1 | -0.201667114660007 | tissue |
| 104 | rs1014892217 | NR2F1 | -0.100398737715935 | WFS1 | -0.201667114660007 | tissue |
| 106 | rs1014892217 | ESR2 | 0.0575311120768812 | WFS1 | -0.201667114660007 | tissue |
| 233 | rs1023497 | HSF2 | -0.0350147089725436 | CENPM | 0.0964497445898264 | tissue |
TP53 data
TP53_consequences <-as.factor(filter(Blood_data, gene == "TP53")$Consequence)
TP53_function <- function(x, datas){
return(c(x, nrow(filter(datas, Consequence == x)), length(unique(filter(datas, Consequence == x)$TF)), length(unique(filter(datas, Consequence == x)$var_id))))
}
TP53_data <- unique(filter(Blood_data, gene == "TP53")[,c(1, 5, 13)])
TP53_consequences <- as.data.frame(base::table(unlist(TP53_data$Consequence)))
TP53_other <- TP53_data[TP53_data$Consequence %in% unlist(TP53_consequences[TP53_consequences$Freq < 5,]$Var1),]
TP53_other <- data.frame(Consequence = "other", num_consequence = nrow(TP53_other), num_TFs = length(unique(TP53_other$TF)), num_var = length(unique(TP53_other$var_id)))
TP53_NoOther <- TP53_data[!TP53_data$Consequence %in% unlist(TP53_consequences[TP53_consequences$Freq < 5,]$Var1),]
TP53_NoOther <- as.data.frame(t(sapply(X = unique(TP53_NoOther$Consequence), function(x){TP53_function(x, TP53_NoOther)})))
colnames(TP53_NoOther) <- c("Consequence", "num_consequence", "num_TFs", "num_var")
TP53_consequences <- rbind(TP53_NoOther, TP53_other)
TP53_consequences <- data.frame(consequence = rep(TP53_consequences$Consequence, each = 3), num = rep(c("num_consequence", "num_TFs", "num_var"), times = nrow(TP53_consequences)), value = as.numeric(unlist(t(TP53_consequences[c(2:4)]))))
ggsave( filename = "./Results/Plots/TP53_consequences.png",
ggplot(data=TP53_consequences, aes(x = reorder(consequence, value), y=value, fill=num)) +
geom_bar(stat="identity", position=position_dodge()) + theme(axis.text.y = element_text(size = 8)) + coord_flip() + labs(title = "TP53 consequences", y = "Appearances", x ="Consequence")
)## Saving 8 x 5 in image
Figure 4: TP53 consequences